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Abstract 

Protein structures are much more conserved than sequences during evolution. Based on this observa- 
tion, we investigate the consequences of structural conservation on protein evolution. We study seven of 
the most studied protein folds, finding out that an extended neutral network in sequence space is associ- 
I ■ ated to each of them. Within our model, neutral evolution leads to a non-Poissonian substitution process, 
due to the broad distribution of connectivities in neutral networks. The observation that the substitution 
process has non-Poissonian statistics has been used against the original Kimura's neutral theory, while 
our model shows that this is a generic property of neutral evolution with structural conservation. Our 
model also predicts that the substitution rate can strongly fluctuate from one branch to another of the 
evolutionary tree. The average sequence similarity within a neutral network is close to the threshold of 
randomness, as observed for families of sequences sharing the same fold. Nevertheless, some positions 
are more difficult to mutate than others. We compare such structurally conserved positions to positions 
conserved in protein evolution, suggesting that our model can be a valuable tool to distinguish structural 
from functional conservation in databases of protein families. These results indicate that a synergy be- 
tween database analysis and structurally-based computational studies can increase our understanding of 
protein evolution. 



Keywords: Neutral evolution, Non-Poissonian substitution porcess, Conserved protein residues 



INTRODUCTION 

The advent of large scale genome projects 
is transforming the field of molecular evolution 
(Koonin et al, 2000). The molecular mechanisms 
of evolution are becoming increasingly amenable 
to direct observation (Henikoff et al, 1997; Ger- 
stein, 1998: Thornton et al, 1999). and it has 
become possible to study molecular evolution not 
only in the context of population genetics, but also 
by considering the thermodynamic stability of the 
biomolcculcs involved in evolution. This "struc- 
tural" approach has been pioneered by Schuster 
and co-workers, with a series of studies of neutral 
networks of RNA secondary structures (Schuster et 
al., 1994; Huynen et al., 1996; Fontana & Schuster, 
1998) and it has been applied to proteins by sev- 
eral groups (Shakhnovich et al., 1996; Bornberg- 
Bauer, 1997; Bornberg-Bauer & Chan, 1999; Baba- 
jide et al., 1997; Govindarajan & Goldstein, 1997, 
1998; Bussemaker et al, 1997; Tiana et al, 1998; 
Mirny & Shakhnovich, 1999; Bastolla et al, 1999, 
2000b; Dokholyan & Shakhnovich, 2001). Despite 
these stimulating studies, however, the structural 
approach has not yet been used to investigate the 
classical issues in molecular evolution, as we set 
out to do here. 

In this work we apply to seven of the most stud- 
ied protein folds the structurally constrained neu- 
tral model (SCN) which three of us recently in- 
troduced in the context of lattice models (Bas- 
tolla et al, 1999; Bastolla et al, 2000b). We 
compare qualitatively the substitution process ob- 
tained from the SCN model and that observed in 
protein sequence evolution. The SCN model is 
based on the observation that evolution conserves 
protein structure much more than protein sequence 
(Holm & Sander, 1996; Rost, 1997). We assume 
that all mutations conserving the structure have 
the same probability of being fixed, thus resulting 
in a neutral model. The main reason to introduce 
structure conservation as a working hypothesis is 
the experimental observation that many mutations 
do not modify significantly the activity of a protein 
and its thermodynamic stability, while mutations 
substantially improving protein functionality are 
rare (Orencia et al, 2001). 

The neutral theory of molecular evolution was 
introduced in the late 60s by Kimura (1968) and 
by King and Jukes (1969) to explain the high sub- 
stitution rates observed in vertebrates as well as 
the large amount of intra-specific genetic variation 
observed in most species. According to neutral- 
ism, most amino acid substitutions arc fixed in 



the genome of a species not because they bring 
a selective advantage but due to random genetic 
drift acting on alleles of equivalent selective value. 
Kimura's neutral model predicts that the rate of 
amino acid substitution of a given protein is ap- 
proximately constant for different species within 
major evolutionary groups, independent of the 
number of individuals and ecology of the species. 
This was in agreement with earlier observations on 
protein evolution which lead to postulate a kind of 
"molecular clock" (Zuckcrkandl & Pauling, 1962). 

According to Kimura's theory, the fraction of 
amino acids neutrally substituted in a time T is a 
Poissonian variable of expectation value kT where 
k, the substitution rate, is different for different 
proteins but does not change for different species. 
A subsequent study by Ohta and Kimura (1971) 
measured the variance of the substitution process 
acting on different species, finding that it is larger 
than the mean, i.e. the process is non-Poissonian. 
Such a result was confirmed by the more sophis- 
ticated analyses by Langlcy and Fitch (1973) and 
Gillespie (1989). This and other observations lead 
first Ohta and then Kimura to adopt, in place 
of the original neutral model, a model based on 
slightly deleterious mutations (Ohta, 1976), and 
Gillespie to reject in toto the neutral theory favour- 
ing the hypothesis that most substitutions in pro- 
tein sequences are fixed by positive selection (Gille- 
spie, 1991). Takahata, however, showed that an ex- 
tension of the neutral theory, the fluctuating neu- 
tral space model (Takahata, 1987), accounts for 
the non-Poissonian statistics of substitutions. 

One of the goals of this study is to investigate 
the consequences of structural conservation on the 
properties of neutral networks and on the substi- 
tution process associated to them. We show that 
neutral evolution docs not lead to a Poissonian 
substitution process. This result complements 
the fluctuating neutral space model by Takahata 
(1987), and suggests that arguments against the 
neutral theory based on the fact that the substi- 
tution statistics is non-Poissonian (Gillespie, 1991) 
are not conclusive. A deeper understanding of the 
mechanism of neutral evolution will help to single 
out the perhaps less common but more interesting 
cases of positive selection as, for instance, func- 
tional changes and responses to changes in the en- 
vironment. It can also be useful for calibrating the 
molecular clocks used to reconstruct phylogcnctic 
trees, whose reliability is severely limited by the 
fluctuations of the substitution rate (Ayala, 1997). 

Another interesting application of the SCN 
model is the possibility to distinguish between 



functional and structural conservation. By sim- 
ulating neutral evolution we identify the key po- 
sitions which are more difficult to mutate. We 
identify them as structurally conserved positions, 
and those positions conserved in actual evolution 
but not in the SCN model as functionally con- 
served ones. Practically all residues whose func- 
tional role is known belong to this class. Most 
positions are not conserved in the SCN model, as 
similarly observed in actual evolution data. We 
identify them as neutrally evolving positions, and 
argue that their preeminence is an evidence of the 
importance of neutral evolution. Finally, a small 
number of positions appear structurally important 
in the SCN model but are not significantly con- 
served. This could be due to a limitation either of 
the SCN model or of the protein database, but it 
could also be a clue of structural changes, possi- 
bly positively selected. Other methods to identify 
computationally structurally important positions 
have been proposed recently (Kannan & Vishvesh- 
wara, 1999; Cecconi et al., 2001). In particular, an- 
other method based on simulated evolution has ap- 
peared in a recent preprint after this work had been 
completed (Dokholyan & Shakhnovich, 2001). 

As most computational studies of protein evolu- 
tion, the SCN model is based on an approximate 
stability criterion relying on the Z-score (Bowie et 
al., 1991; Goldstein et al, 1992) and on a fold- 
ing parameter measuring the degree of correlation 
of the energy landscape (Bastolla et al., 1999). 
While these parameters can not predict precisely 
the thermodynamic stability of a specific sequence, 
our previous studies show that they correlate with 
the observed stability. Thus, we expect that the 
statistical properties derived from the analysis of 
a large number of sequences capture real features 
of protein evolution. 



STRUCTURALLY CONSTRAINED 
NEUTRAL MODEL 

Following Kimura, we divide mutations in two 
classes: those which result in inactivating the pro- 
tein, which are regarded as lethal and can not 
spread in the population, and those after which 
the protein remains active, which are regarded as 
selectively neutral. This mutational spectrum im- 
plies that protein sequences evolve on a neutral 
network, i.e. a set of sequences where the protein 
is active and which can be connected through point 
mutations. Under this mutational spectrum, fixa- 
tion of slightly deleterious mutations can not take 



place, since these are not included in the model, as 
well as advantageous mutations. This is of course 
an important limitation of neutral models. 

Kimura's neutral model assumes that the rate 
of appearence of neutral mutations is constant 
throughout evolution. In a paper of 1977, how- 
ever, Kimura comments that rate constancy may 
not hold exactly (Kimura, 1977). Our model is not 
based on any assumption on the neutral mutation 
rate. Rather, we compute the effect of mutations 
on protein stability using an effective model of pro- 
tein folding (Bastolla et al., 2000a) which provides 
us with a genotype to phenotype mapping. In this 
respect, the rate of occurrence of neutral muta- 
tions is an outcome of the model. It turns out that 
this rate shows very broad fluctuations through- 
out evolution. As we shall see below, the variance 
in evolutionary rates predicted by our model is in 
qualitative agreement with observations of protein 
evolution (Gillespie, 1991). 

The model does not take into account popula- 
tion dynamics. This is based on the fact that, 
within Kimura's model, the substitution rate is 
not influenced by population size. An extension 
of Kimura's model to take into account small vari- 
ations in the neutral mutation rate confirmed this 
result (Bastolla & Peliti, 1991). However, popula- 
tion size might influence the substitution rate if the 
rate of neutral mutations shows broad fluctuations, 
as observed here. The explicit inclusion of popu- 
lation genetics into the model would be needed to 
investigate this interesting possibility. 

A neutral network is defined starting from a pro- 
tein sequence in the Protein Data Bank (PDB). 
The corresponding protein structure has to remain 
thermodynamically stable during evolution. Ther- 
modynamic stability is evaluated through an effec- 
tive model of protein folding. The folding parame- 
ters of the native structure, computed through the 
model (see Materials and Methods), must be above 
98.5 percent of the value they have in the PDB se- 
quence. Sequences where this condition is met are 
named viable sequences. A neutral network is a set 
of viable sequences which can be connected to the 
starting sequence through point mutations pass- 
ing on other viable sequences. Thus sequences on 
a neutral network share the same protein fold and 
are cvolutionarily connected. For every amino acid 
sequence A in the neutral network we can measure 
the fraction of neutral neighbors x(A), which is the 
fraction of its possible point mutations which are 
viable. 

We model protein evolution at the level of a 
single sequence. During evolution, the sequence 



moves on the neutral network generating an evo- 
lutionary trajectory, i.e. a list of subsequently vis- 
ited sequences belonging to the neutral network. 
In the present context, the only relevant quantity 
is their fraction of neutral neighbors x(A) G (0, 1]. 
Thus an evolutionary trajectory can be represented 
through a very long list x = {xi, X2, ■ ■ •}■ 

The rate of occurrence of a neutral mutation 
starting from sequence A is the product of a con- 
stant mutation rate times the probability x(A) 
that the mutation is viable. Thus the neutral mu- 
tation rate is not constant in the framework of our 
model, and can be explicitly computed by comput- 
ing x(A) for all sequences in an evolutionary tra- 
jectory. The statistics of the substitution process 
can then be obtained by coupling the evolution- 
ary trajectory generated as above to a Poissonian 
mutation process according to the following rules: 

1. Mutation process: The number of muta- 
tions in a time t is a Poissonian variable of 
average fit. 

2. Acceptance process: Given one realiza- 
tion of the evolutionary trajectory and a 
number of mutations k, the conditional prob- 
ability that n of them are accepted is the 
product of n + 1 geometric distributions of 
parameters 1 — xf. 

/ n \ n+1 

Pace (n | A) =11*0 £ IT^-^r^ 
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(1) 



where the {rrij} are all integer numbers be- 
tween zero and k — n satisfying Y^j=i m j = 
k — ii. In other words, the probability that a 
mutation is accepted is x\ = x(Ai) as long 
as the protein sequence is Ai, X2 = x(Aa) as 
long as the sequence is A2, and so on. 

Two kinds of random variables must be distin- 
guished. We indicate by angular brackets the av- 
erage over mutation and acceptance process for a 
given realization of the evolutionary trajectory and 
by an overline the average over evolutionary tra- 
jectories. The variance of the substitution process 
can be decomposed in two components: 



V(St) = V ll (St) + V x (S t ) 



<5 t 2 )-(5 t ) 2 + I (S t )* -(S t ) 



(2) 



The first term, V^ , is the variance of the muta- 
tion and acceptance process, averaged over evo- 
lutionary trajectories. The second term, V Xl is 



the variance of the substitution rate with respect 
to different evolutionary trajectories. This term, 
which is not present in the standard neutral model, 
explains why the variance of the number of sub- 
stitution is typically larger than its mean value, 
contrasting with a Poissonian process. 

If all sequences have the same fraction of neutral 
neighbors x(A) = x, the number of substitutions 
in a branch of length T is Poissonian with mean 
fiTx and the substitution rate is equal to fix as in 
Kimura's model. If V x is not zero, the substitu- 
tion distribution is more complicated and has to 
be computed numerically using the evolutionary 
trajectories simulated. 



NUMERICAL RESULTS 

Folding of random sequences 

As a preliminary analysis, we measured the dis- 
tribution of the folding parameters a and —Z' (see 
Materials and methods) of the native structures 
considered in this work for random sequences of the 
same length of the corresponding PDB sequences. 
On over 20,000 attempts, we always found folding 
parameters much lower than for PDB sequences. 
The only exception was the smallest protein, the 
53-residues rubredoxin, for which a single random 
sequence had one of the stability parameters com- 
parable to that of the PDB sequence, even if still 
smaller than it. This result is consistent with 
the work of Keefe and Szostak who were recently 
able to select ATP binding proteins from a ran- 
dom library of more than 10 4 sequences (Keefe & 
Szostak, 2001). We note that it is possible to eval- 
uate the size of the neutral network from the joint 
distribution of a and Z' , but to this end better 
statistics are needed than those obtained here. 



Connectivity of neutral networks 

For sequences A belonging to the neutral net- 
work, the fraction of neutral neighbors x(A) counts 
the fraction of all possible point mutations of A 
which still fall into the neutral network. We mea- 
sured this quantity for at least 20,000 sequences 
for each fold, finding that it has a broad distribu- 
tion (see Fig. ||). The shape of the distribution is 
qualitatively similar for all of the studied proteins, 
but in the case of cytochrome c, the distribution 
is shifted to lower connectivities. Results for all 
seven folds are summarized in Table 1. 
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FIG. 1. Distribution of the fraction of neutral 
bors for four protein folds. 
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The connectivity landscape a; (A) is locally cor- 
related for distances in sequence space of the order 
of at least ten substitutions. The correlation func- 
tion C{t) = (<*(A t0 )x(A t0+t )> - (x(A)} 2 ) /a 2 x 

between connectivities of two sequences at dis- 
tance of t steps decays similarly for all proteins 
and can be fitted to a stretched exponential C(t) « 
exp (— (i/r) 77 ), with exponents r\ ranging from 0.60 
to 0.66 and correlation lengths r ranging from 1.8 
to 2.8. Thus correlations decay to one tenth after 
about ten substitutions (data not shown). 



Substitution process 

The broadness of the connectivity distribution 
directly implies that the substitutions process fluc- 
tuates more than a Poissonian process, i.e. it is 
overdispersed. We computed average and variance 
of the substitution process numerically, using the 
evolutionary trajectories generated in our simula- 
tions. Results for myoglobin are shown in Fig. 0. 
Notice that the substitution rate {S t )/t is roughly 
constant in time, and the total dispersion index 
R(t) = V(St)/(St) takes values between 1.0 at 
small t and 1.9 at large t, consistent with the value 
R = 1.7 estimated by Kimura for myoglobin. 
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FIG. 2. Moments of the substitution process for 
myoglobin versus mutational time fit. Solid line: av- 
erage substitution rate divided by fi, {St}/ fit. Black 
circles: mutational variance {Sf) — {St} 2 divided by 
the mean (-R M ). White circles: total dispersion in- 
dex, _R M + R x , where R x is the trajectory variance 
JST/ 1 - JSt} 2 divided by JSt}. 

Due to local correlations in sequence space, dif- 
ferent evolutionary trajectories x(Ai) • • ■ i(A„), 
representing different populations, give different 
mean and variance of the substitution process over 
short time scales. This phenomenon produces new 
lineage effects, i.e. apparently varying substitution 
rates in different branches of the phylogenetic tree. 
To illustrate them, we show in Fig. the mean 
{St({x})) and the variance V(St, {x}) for three re- 
alizations of the evolutionary trajectory {x}. Such 
an effect could overshadow the generation time ef- 
fect for replacement substitutions (Britten, 1986; 
Li et at, 1987; Gillespie, 1991). It could also been 
responsible for the wide fluctuations in the substi- 
tution rate for different lineages observed by Ayala 
(1997). 
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FIG. 3. Substitution process for three realizations 
of the evolutionary trajectories of myoglobin. Black 
circles represent variances, no symbols represent mean 
values. 



Sequence similarity 

One may ask whether sequences sharing the 
same fold must have a high level of similarity. To 
investigate this question, we measured the distri- 
bution of sequence similarity for sequences in the 
neutral network obtained through simulation of 
our model as well as for homologous sequences in 
public databases. Similarity between two aligned 
sequences of the same length is defined as the frac- 
tion of positions where the same amino acid ap- 
pears. Results are showed in Fig. || for the globin 
fold. Similar results have been obtained for all 
other folds. 

Sequences in the neutral network (solid line) 
have an average similarity only slightly larger than 
random sequences, for which a Gaussian-like dis- 
tribution of average value 1/20 is expected. This 
confirms a previous finding of three of us for neu- 
tral networks of lattice structures (Bastolla et al., 
1999). As previously observed by Rost (1997), 
the same result holds also for sequences in the 
FSSP family of sequences sharing the same struc- 
ture (Holm & Sander, 1996), which are showed 
as dotted line. Low similarity for sequences with 
the same fold has also been found in a recent 
computational study based on sequence optimiza- 
tion for native protein structures (Dokholyan & 
Shakhnovich, 2001). The dashed line shows the 
similarity distribution for sequences in the PFAM 
database of similar sequences (Bateman et al, 
2000). The PFAM family has on the average a 
much larger similarity. This is in part due to the 
fact that in this case sequence similarity must be 



large enough for the homology to be detected and 
in part to the fact that proteins in a PFAM fam- 
ily are subject to stronger functional conservation 
than proteins in the FSSP family. For the globin 
family, which has been intensively studied, even 
the PFAM similarity is very low. 
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FIG. 4. Sequence similarity distribution for the 

myoglobin fold. 



Residue conservation 

The SCN model identifies structurally conserved 
residues. In this respect, the present results can 
be compared to the model of Shakhnovich and 
coworkers (Shakhnovich et al., 1996; Dokholyan & 
Shakhnovich, 2001) based on sequence optimiza- 
tion (Shakhnovich & Gutin, 1993; Shakhnovich, 
1994), and to the bioinformatic studies of Ptitsyn's 
group (Ptitsyn, 1998; Ptitsyn & Ting, 1999). 

We evaluate the conservation of each position 
measuring its rigidity (see Materials and Meth- 
ods). This was done for each fold using three dif- 
ferent sets of sequences: (i) Sequences obtained 
from simulations of our neutral model; (ii) Homol- 
ogous sequences in the PFAM database (Bateman 
et al., 2000); (iii) Sequences with the same struc- 
ture in the FSSP database (Holm & Sander, 1996). 
The PFAM family often contains orthologous pro- 
teins performing the same function while in the 
FSSP family different functions may be present 
and function conservation plays a less important 
role. Nevertheless, there is usually a good correla- 
tion between the rigidity of a given position evalu- 
ated through the PFAM and FSSP databases. An 
exception is the TIM barrel family, one of the most 
common folds, used to perform different functions, 
each approximately associated to a different PFAM 



family. In this case, the two sets of rigidities show 
no correlation. For this fold, using a bioinformatic 
analysis, Mirny and Shakhnovich (1999) found ev- 
idence of functional conservation (the same func- 
tional positions tend to be conserved in all func- 
tional families, although with different residues in 
each functional family) but could not find evidence 
of structural conservation. 

Before turning to the analysis of conserved po- 
sitions, we observe that there are three reasons 
why sequence databases may tend to overestimate 
structurally based conservation. The first one is 
the small size of databases. The second one is 
the fact that many sequences are evolutionarily re- 
lated: databases usually provide biased samples of 
the tree of life. To reduce these effect, we limit our 
analysis to sequences that do not have similarity 
larger than a threshold gthr which we choose equal 
to 0.85 in order not to reduce too much database 
size, and try to estimate the maximal conservation 
that one would observe with a database of sim- 
ilar size and correlations, in the null hypothesis 
that all positions are equivalent. The third reason 
is that many residues are conserved on functional 
grounds, sometimes even in the FSSP database, 
and it may be difficult to distinguish them from 
structurally conserved residues. 

Conservation in the neutral network only ex- 
presses structural conservation, thus the compari- 
son between rigidities predicted by the SCN model 
and observed in evolution may allow to single out 
functionally conserved positions or positions in- 
volved in interactions with cofactors, which are 
not represented in our model. We tested this for 
two well studied protein families: the globin family 
and the cytochrome c family. In both cases struc- 
turally conserved positions identified by the SCN 
model coincide with those identified in previous 
bioinformatic studies as part of the folding nucleus 
(Ptitsyn, 1998; Ptitsyn & Ting, 1999), and addi- 
tional structurally conserved positions are found. 
For other protein families less is known about func- 
tional residues, but the few ones which are iden- 
tified in the SwissProt file are recognized as such 
by the SCN model. In our analysis of the globin 
family, positions in contact with the heme are not 
regarded as structural, even if the heme plays also 
an important stabilizing role, since interactions be- 
tween amino acids and cofactors are not considered 
in the model, and they are much more specific than 
interactions between amino acids. 

In Fig. we compare the rigidities obtained from 
our model to those measured in the FSSP family 
for the myoglobin fold. Each point represents a po- 



sition on the native structure. The dotted lines in 
the figure are rough estimates of the maximal rigid- 
ity expected in a random situation, i.e. all equiv- 
alent residues and same distribution of similarity 
as in the set examined. Only residues more rigid 
than that are considered significantly conserved. 
Many of the most conserved residues are in contact 
with the heme group (large circles). A notable ex- 
ception is Pro37n, which is strongly conserved and 
not in contact with the Heme group. Although 
the conservation of this residue has not been fully 
explained so far, Ptitsyn and Ting report that it 
may be due to functional reasons (Ptitsyn & Ting, 
1999). The three positions most conserved accord- 
ing to the SCN model coincide with structural po- 
sitions identified in the bioinformatic analysis by 
Ptitsyn and Ting. They are, in order of rigidity, 
Leull5, Trpl4, Metl31. VallO is rather conserved 
both in our model and in the bioinformatic study. 
The remaining two positions identified by Ptitsyn, 
Ilelll and Leul35, are not among the most con- 
served in our model, although they are above the 
average. In addition, there are eight more posi- 
tions significantly conserved in the SCN, whose 
evolutionary conservation is somewhat lower (but 
above the average in all cases except Hisll9). They 
are: Vall3, Vall7, His24, Leu69, Leu76, Hisll9, 
Phel23, Alal34. Interestingly, structurally con- 
served residues form a cluster, so that it has been 
proposed that they play the role of "folding nuclei" 
(Shakhnovich et ai, 1996; Ptitsyn 1998; Mirny & 
Shakhnovich, 2001). A similar situation applies 
also for the case of cytochrome c: two of the posi- 
tions identified in (Ptitsyn, 1998) are the most con- 
served in our model (Phe7, Leu74), another one is 
significantly conserved (Trp77) and the fourth one 
is not present in the structure we choose as refer- 
ence (PDB code 451c). Moreover, there are three 
positions significantly conserved in our model and 
in the FSSP alignment (Tyr27, Ile48, Val66) and 
one conserved in our model but not in the align- 
ment (Gly36). 



* Residues are labeled in the order in which they are 
listed in the PDB file of the structure la6g 
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FIG. 5. Rigidity in the FSSP family versus rigidity 
in the neutral network for the myoglobin fold. 



the SCN model for the two structures in Fig. 
(upper panel). There is a remarkable correlation, 
despite the fact that results are obtained from inde- 
pendent evolutionary runs with different selection 
parameters. In Fig. |7| (lower panel) we compare 
the rigidity observed in the SCN model (for mes- 
ophylic rubredoxin) with the rigidities observed in 
the PFAM and FSSP databases. 
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FIG. 6. Structure of myoglobin with the Heme 
group. The structurally most conserved residues are 
represented in color together with their side chains. 
These are: Leull5, Trpl4, Metl31, VallO, Vall3, 
Vall7, His24, Leu69, Leu76, Hisll9, Phel23, Alal34. 
The PDB code of the structure is la6g. The colour 
code represents temperature increasing from blue to 
red. 

We show a similar plot also for rubredoxin, a 
small bacterial protein involved in electron trans- 
port. In this case, we studied two homologous pro- 
teins, one from a mesophylic and one from a ther- 
mophylic bacterium. Their sequences have 57% 
similarity and belong to the same PFAM and FSSP 
classes. Although the structures are rather similar, 
the stability of the thermophylic protein, as mea- 
sured by the Z' and a parameters, is higher than 
the stability of the mesophylic protein, as it should 
be. This result supports our choice of the stability 
criteria. We compare the rigidities obtained from 
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FIG. 7. Comparison of rigidities between the two 
simulated neutral networks of mesophylic and thermo- 
phylic rubredoxin (top) and between simulated and ob- 
served rigidities for mesophylic rubredoxin (bottom). 
Full circles refer to the PFAM family, open diamonds 
to the FSSP family. 

Finally, we show in Fig. || a scatter plot of 
rigidities for the protein showing the worst correla- 
tion between predicted and observed rigidities: the 
TIM barrel, one of the most common folds, used 
for several enzymatic functions. The one that we 
studied is a triose phosphate isomerase function- 
ing in the glycolysis. In this case, there is also 
no observable correlation between rigidities in the 
PFAM and FSSP databases, and rigidities in the 



FSSP class are very low, in particular because sev- 
eral residues are deleted in many sequences. 
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FIG. 8. Rigidity in the FSSP database versus rigid- 
ity in the neutral network for the TIM barrel fold. The 
circles identify two positions of known functional role 
for one of the enzymes of the TIM barrel family. 

Unlike for other folds, in the case of ubiqui- 
tin the structurally conserved positions are dis- 
tributed along the main axis of the protein. There 
is also a conserved polar position on one loop. 
For lysozyme as well the most conserved residues 
form a non compact core. In all other cases the 
structurally conserved residues form a hydropho- 
bic cluster which is rather compact. There is some 
correlation between conservation and burial in the 
interior of the protein as measured by the number 
of contacts, but burial alone does not explain all of 
the conservation. We notice that also in our previ- 
ous lattice simulation the most conserved positions 
were those in the interior of the lattice structure 
(Bastolla et ai, 1999). 



DISCUSSION 

In this work we studied a neutral model of pro- 
tein evolution based on structure conservation. For 
all of the examined folds, local connectivities of 
neutral networks are broadly distributed. This re- 
sult implies that even in neutral evolution the num- 
ber of substitutions fluctuates more than a Poisso- 
nian variable, i.e. it is overdispersed. Therefore 
overdispersion can not by itself be used as a test 
for positive selection, as argued for instance by 
Gillespie (1991). Our results, nevertheless, show 
that the dispersion index of the SCN substitution 
process is typically small, rarely overcoming four. 



Thus, proteins with extremely high dispersion in- 
dex, as some of those studied by Gillespie (1989) or 
Ayala (1997) are not likely to have evolved in this 
way. The average substitution rate is almost con- 
stant in time, but it may vary considerably for dif- 
ferent evolutionary trajectories, corresponding to 
different branches of the phylogenetic tree. This 
fact should be taken into account when studying 
lineage effects such as the generation time effect 
(Britten, 1986; Li et ai, 1987; Gillespie, 1991). 

By simulating neutral evolution, we identi- 
fied structurally conserved positions and com- 
pared them to evolutionarily conserved positions 
in known protein families. The comparison is fa- 
vorable for myoglobin, cytochrome c, lysozyme, ri- 
bonuclease and rubrcdoxin, while for ubiquitin and 
the TIM barrel correlation between predicted and 
observed conservation is almost absent. We note, 
however, that the TIM barrel shows very little 
structural conservation, and the small size of the 
ubiquitin family makes the comparison not con- 
clusive. The plots comparing conservation in sim- 
ulated evolution (on the abscissa) to conservation 
in real evolution (on the ordinate) can be divided 
in four parts. In the upper left quadrant there are 
positions not conserved in the SCN model but con- 
served in evolution. We suggest that most of them 
are conserved for functional reasons or because of 
interactions with cofactors, which are not taken 
into account in our protein model. Positions of 
known functional importance belong to this class, 
but not enough is known on protein function to 
prove our interpretation in all cases. In the upper 
right quadrant there are positions conserved both 
in the SCN model and in the databases, whose 
conservation is likely to have a structural ground. 
For this small subset the rigidities that we predict 
are correlated to the observed ones. Interestingly, 
those positions form spatial clusters which have 
been identified with folding nuclei (Shakhnovich 
et ai, 1996; Ptitsyn, 1998; Mirny & Shakhnovich, 
2001). Although we can not discuss such interpre- 
tation, since our evolutionary algorithm docs not 
take into account folding kinetics, it is to be ex- 
pected that positions important for stability also 
play an important kinetic role. In the bottom left 
quadrant there are positions not conserved neither 
in the SCN model nor in the databases. These 
positions are likely to be the main actors in neu- 
tral evolution. Last, in the bottom right quad- 
rant there are few positions conserved according 
to the SCN model which do not appear to be evo- 
lutionarily conserved. Barring artifacts due to the 
SCN model, we should consider the possibility of 



conservation with low rigidity (but typically much 
higher than random). In order to verify whether 
this is the case, we need larger and less correlated 
protein classes. Another possibility is that these 
positions are frequently substituted because they 
can produce structural changes, possibly positively 
selected. Although this possibility is rather spec- 
ulative, it would be interesting to investigate it in 
more detail. 

Our results are based on an approximate sta- 
bility criterion relying on the Z-score (Bowie et 
al., 1991; Goldstein et al., 1992) and on a pa- 
rameter measuring the degree of correlation of the 
energy landscape (Bastolla et al, 1999). While 
such criterion may not be suitable for the quan- 
titative prediction of the thermodynamic stability 
of particular proteins, we believe that the statis- 
tical properties of the SCN model reflect those 
of actual protein evolution. This confidence has 
several grounds. First, in the study of a lattice 
model, three of us have previously applied a rigor- 
ous criterion of stability, and compared it to a cri- 
terion obtained from the Z-score (Bastolla et al., 
2000b). Although the two criteria give different 
responses for specific sequences, it is possible to 
choose a threshold such that most sequences se- 
lected with the Z-score criterion are also selected 
with the rigorous criterion. Second, the present 
results are robust with respect to changes in the 
selection thresholds and stability criteria. Third, 
wc: tested our stability parameters on a large 1 num- 
ber of sequences obtained from mutations of a TIM 
barrel enzyme, whose phenotypic effect has been 
experimentally measured in a recent paper (Silver- 
man et al., 2001). We found that, even if our cri- 
terion can not predict the effect of individual mu- 
tations, the latter is correlated to the a parameter 
with correlation coefficient 0.4. 

The present results can be used in the rational 
approach to directed evolution of biocatalysts (Al- 
tamirano et al., 2000; Voigt et al., 2001) since we 
identify sites that are more tolerant to mutations 
and therefore likely targets for evolutionary im- 
provement. This is a remarkable possibility, since 
it indicates how results based on the assumption of 
neutral evolution can be used to search for positive 
substitutions. 



MATERIALS AND METHODS 



Protein model 



We represent a protein structure by its contact 
matrix C'ij , where CV, = 1 if residues i and j are 
in contact and Cij = otherwise. Two residues 
are considered in contact if any two of their heavy 
atoms are closer than 4.5 A. The effective free en- 
ergy associated to a sequence of amino acids A in 
the configuration C is approximated as a sum of 
pairwise contact interactions, 



E(A,C) = J2Q j U(A i ,A j ), 



(3) 



Kj 



where Ai labels one of the twenty amino acid types 
and U(a, b) is a 20 x 20 symmetric interaction ma- 
trix. Here we use the matrix derived by Bastolla et 
al. (2000a), which describes accurately the ther- 
modynamic stability of a large set of monomeric 
proteins (Bastolla et ai, 2001). 

Three remarks are needed: First, the effective 
energy parameters implicitly take into account the 
effect of the solvent and depend on temperature. 
They express free energy rather than energy. Sec- 
ond, the effective energy of a structure is defined 
with respect to a completely extended reference 
structure where no contacts are formed and which 
sets the zero of the energy scale. Third, one 
can derive from the database not the parameters 
U(a, b) themselves but the dimensionless quanti- 
ties U(a, b)/ksT. It is thus important to use di- 
mensionless parameters to evaluate the stability of 
the protein model. 



Candidate structures 

We generate candidate structures for a protein 
sequence of N residues by generating all possible 
gaplcss alignments of the sequence with structures 
in the Protein Data Bank. This procedure is called 
threading. In this way, we typically generate sev- 
eral hundreds of thousands of protein-like struc- 
tures per sequence. In the present context, thread- 
ing is directly used to produce the contact maps of 
the candidate structures. In order to speed up the 
computations, we use a non redundant subset of 
the PDB excluding proteins with homologous se- 
quences, selected by Hobohm & Sander (1994). 
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The folding parameter a 

For a given sequence A, the energy landscape 
is well correlated if all configurations of low en- 
ergy are very similar to the configuration of min- 
imal effective energy, C*(A). Structure similar- 
ity is measured by the overlap q(C, C*), counting 
the number of contacts that two structures have 
in common and normalizing it through the max- 
imal number of contacts, so that q is comprised 
between zero and one. In a well correlated energy 
landscape, the inequality holds 



E(A,C) - E(A,C*) 
\E(A,C*)\ 



>a(A)(l-«(C,C*)) 



(4) 



stating that the energy gap between each alterna- 
tive structure C and the ground state C*, mea- 
sured in units of the ground state energy is larger 
than a quantity a (A) times the structural dis- 
tance 1 — q(C,C*). The dimensionless quantity 
a (A), which is the largest quantity for which the 
above inequality holds, can be used to evaluate the 
folding properties of sequence A. For random se- 
quences, many different configurations have quite 
similar energy and a(A) « 0. In this case the 
energy landscape is rugged, the folding kinetics is 
very slow and the thermodynamic stability with 
respect to variations in the solvent is very low. In 
contrast, computer simulations of well designed se- 
quences have shown that, when a(A) is finite, the 
folding kinetics is fast and the stability with re- 
spect to changes in the energy parameters as well 
as mutations in the sequence is very high. 

Our algorithm computes the parameter a (A) for 
a fixed target configuration C* and a large num- 
ber of sequences A. We thus indicate this param- 
eter as a(A, C*), since we do not know a priori 
that C* has lowest energy. Notice however that, 
if a(A, C*) is positive, all alternative structures 
have higher energy than C*. We impose that 
a(A, C*) is larger than a positive threshold athr 
for sequences A belonging to the neutral network. 



Z-score 

The Z-scorc Z(A, C*) (Bowie et al, 1991; Gold- 
stein et at, 1992) is a measure of the compatibility 
between a sequence A and a structure C*, widely 
used in structure prediction. It depends on an ef- 
fective energy function, and measures the differ- 
ence between the energy of sequence A in configu- 



ration C* and its average energy in a set of alterna- 
tive configurations, {C}, in units of the standard 
deviation of the energy: 



Z(A,C* 



E(A,C*)-(E(A,C)) 
y/(E(A,C)*)-(E(A,C))* 



(5) 



When sequence A folds in structure C* their 
corresponding Z score is very negative. 

Given the above definition, one has still to spec- 
ify how to select alternative structures. A pos- 
sibility, often used for lattice models (Mirny & 
Shakhnovich, 1996) is to assume that alternative 
structures are maximally compact, randomly cho- 
sen structures, whose average energy can be es- 
timated as (E(A,C))c = iVc max (e(A)). Here, 
Nc max is the maximal number of contacts of can- 
didate structures and (e(A)) is the average energy 
of a contact, averaged over all possible contacts 
formed by sequence A. This leads to introduce 
the parameter 



Z' 



E(A,C*)/Nc n 



(e(A)) 



V<e2(A)> - (e(A))2 



(6) 



The use of Z' has two main advantages: First, 
it makes the value of the Z-scorc much less sensi- 
tive to chain length N and to the particular set of 
alternative structures used; second, the evaluation 
of Z' is much faster than that of the Z-score. This 
is necessary in order to explore efficiently sequence 
space. 



Sampling the neutral network 

Our algorithm explores the neutral network of 
a given protein starting from its PDB sequence 
Ao and iterating the following procedure: At time 
step t, (i) The number of viable neighbors of se- 
quence At is computed; (ii) The sequence At+i is 
extracted at random among all the viable neigh- 
bors of A t . In this way we generate a stochastic 
process along the neutral network which simulate 
neutral evolution and looses memory of the initial 
sequence very fast. 

Sequence A is regarded as viable if both param- 
eters a(A, C*) and —Z'(A,C*) are above prede- 
termined thresholds, chosen as 98.5 percent of the 
values of those parameters for the sequence in the 
PDB. This enforces conservation of the thermody- 
namic stability and folding capability of the native 
structure C*. We verified that the observed be- 
havior does not change qualitatively for thresholds 
between 95% and about 100% of the PDB values. 
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We impose strict conservation of the cysteine 
residues in the PDB sequence, and do not allow 
any residue to mutate to cysteine, since a mutation 
changing the number of cysteine residues would 
leave the protein with a very reactive impaired 
cysteine that would probably affect its function- 
ality. Accordingly, the total number of neighbors 
tested is X to t — ^8(N—N cys ), where N is the num- 
ber of residues and iV cys is the number of cysteine 
residues in the starting sequence. 

The total number of viable point mutations, 
X(A), expresses the local connectivity of the neu- 
tral network. We normalize it by the total number 
of neighbors, X to t, getting the fraction of neutral 
neighbors, x(A) = X(A)/X tot e (0, 1]. 

To compute a; (A), we have to evaluate the a 
parameter for all sequences A' obtained through 
a point mutation of sequence A. From Eq.(|4|) 
we note that the a parameter can be obtained 
from the configuration with the highest destabiliz- 
ing power, i.e. the highest value of the energy gap 
divided by the structural distance from the native 
configuration. These change from sequence to se- 
quence, but it is expected not to change very much 
for neighboring sequences. Thus, in order to speed 
up the computation of a (A'), instead of consid- 
ering all candidate configurations we consider only 
the 50 configurations with the highest destabilizing 
power (i.e. the energy gap divided by the struc- 
tural distance from the native configuration) for 
sequence A and compute their mutated destabi- 
lizing power using sequence A'. The a parameter 
is then obtained from the configuration with the 
highest destabilizing power. This procedure could 
slightly overestimate a(A') since not all configura- 
tions are used, but we have checked that the error 
introduced in the x value is in all cases below 0.1%. 



Substitution process 

Given an evolutionary trajectory {xi,xi, • ••}, 
the distribution of the number of substitutions tak- 
ing place in a time T can be computed by consider- 
ing Eq.(Ty), where k, the number of attempted mu- 
tations, is a Poissonian variable of average value 

In order to handle the computation, we divide 
all values of Xj in M classes, choosing X a as repre- 
sentative value of all Xj's belonging to class a. The 
number of operations needed to evaluate the sub- 
stitution probability increases exponentially with 
the number of classes M. At the same time, the 
evaluation becomes more and more accurate as M 



increases. We chose M = 6 in our numerical com- 
putations as a reasonable compromise between ac- 
curacy and rapidity, checking that larger values of 
M introduce only small changes. 



Rigidity 

A measure of the conservation profile for a set 
of evolutionarily related sequences can be obtained 
measuring the rigidity of each position i, 



^) = E/»( fl ) 2 



(7) 



where fi(a), a = l,---20 is the frequency with 
which amino acid a is observed at position i, nor- 
malized so that ^2 a fi(a) = 1. Deletion of position 
i in a sequence is regarded formally as a 21st amino 
acid. A large rigidity R(i) means that position i 
is highly conserved. For unconstrained positions 
and in absence of deletions, fi(a) = 1/M, where 
M is the number of amino acids, and R(i) = 1/M. 
In general, rigidities are larger than 1/M because 
of the finite size of the sequence set and because 
sequences in the set are correlated due to com- 
mon evolutionary origin. Since cysteine residues 
are strictly conserved, we always get R(i) = 1 for 
them. Thus we omit these residues from the anal- 
ysis of conservation. 



PFAM and FSSP databases 

We compare the rigidity measured in the set 
of neutral sequences generated with the present 
method with the rigidity obtained from two 
databases: the PFAM database (Bateman et at, 
2000) and the FSSP database (Holm & Sander, 
1996). The PFAM database is a collection of fam- 
ilies of homologous sequences obtained by multi- 
ple alignment. Since multiple alignment methods 
work only for sufficiently high similarity, there are 
no sequences of low similarity in this database. 
The FSSP database is a collection of protein 
classes sharing the same fold (as determined by 
the program of structural alignment DALI (Holm 
&: Sander, 1996)). Since the structures must be 
experimentally known, the FSSP database is usu- 
ally smaller than the PFAM database. However it 
includes in the same class distant homologs whose 
evolutionary relationship can not be detected by 
means of sequence comparison alone. Due to 
database biases, many sequences in the PFAM 
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and FSSP databases are highly similar. To reduce 
this effect, proteins with similarity higher than a 
threshold g t hr = 0.85 to any other protein have 
been eliminated. 



ACKNOWLEDGMENTS 

UB thanks for interesting discussions Bill Eaton, 
Carlos Briones, Nikolai Dokholyan, Leo Mirny, 
Victor Parro, Burkhard Rost, Alfonso Valencia 
and Roeland Van Ham, and the Institut of Crys- 
tallography of the Free University of Berlin for lo- 
gistic help. UB is supported by INTA (Spain). MP 
is supported by the Max Planck Gesellschaft (Ger- 
many). HER is supported by INFN (Italy). MV 
is supported by EMBO and by the Royal Society 
(UK). 



issue, Nucl. Ac. Res. 28, 263-266. Databases avail- 



[1] Ayala, F.J. (1997) Vagaries of the molecular clock 
Proc. Natl. Acad. Set. USA 94: 7776-7783. 

[2] Altamirano, M.M., Blackburn, J.M., Aguayo C. 
& Fersht, A.-R. (2000) Directed evolution of new 
catalytic activity using the alpha/beta-barrel scaf- 
fold, Nature 403, 617-622. 

[3] Babajide, A., Hofacker, I.L., Sippl, M.J. & 
Stadler, P.F. (1997) Neutral networks in protein 
space, Fol. Des. 2, 261-269. 

[4] Bastolla, U. & Peliti, L. (1991), Un modele statis- 
tique d'evolution avec selection stabilisante, C. R. 
Acad. Sci. Paris, 313, Serie III, 101-105. 

[5] Bastolla, U., Farwer, J., Knapp, E.W. & Vendrus- 
colo, M. (2001) How to guarantee optimal stability 
for most representative structures in the protein 
data bank, Proteins 44, 79-96. 

[6] Bastolla, U., Roman, H.E. & Vendruscolo, M. 
(1999) Neutral evolution of model proteins: Dif- 
fusion in sequence space and overdispersion, J. 
Theor. Biol. 200, 49-64. 

[7] Bastolla, U., Vendruscolo, M. & Knapp, E.W. 
(2000a) A statistical mechanical method to op- 
timize energy functions for protein folding, Proc. 
Natl. Acad. Sci. USA 97, 3977-3981. 

[8] Bastolla, U., Vendruscolo, M. & Roman, H.E. 
(2000b) Structurally constrained protein evolu- 
tion: Results from a lattice simulation, Eur. Phys. 
J. B 15, 385-397. 

[9] Bateman, A., Birney, E., Durbin, R., Eddy, S.-R., 
Howe, K.-L. & Sonnhammer, E.-L.-L. (2000) The 
PFAM contribution to the annual NAR database 



able at tittp://pf am. wustl.edu/. 

[10] Bornberg-Bauer, E. (1997) How are model protein 
structures distributed in sequence space?, Bio- 
phys. J. 73, 2393-2403; Bornberg-Bauer, E. & 
Chan, H.S. (1999) Modeling evolutionary land- 
scapes: Mutational stability, topology, and super- 
funnels in sequence space, Proc. Natl. Acad. Sci. 
USA 96, 10689-10694. 

[11] Bowie, J.U., Liithy, R. & Eisenberg, D. (1991) A 
method to identify protein sequences that fold into 
a known 3-dimensional structure, Science 253, 
164-170. 

[12] Britten R.J. (1986) Rates of DNA sequence evo- 
lution differ between taxonomic groups, Science 
231, 1393-1398. 

[13] Bussemaker, H.J., Thirumalai, D. & Bhattachar- 
jee, J.K. (1997) Thermodynamic stability of folded 
proteins against mutations, Phys. Rev. Lett. 79, 
3530-3533. 

[14] Cecconi, F., Micheletti, C, Carloni, P. & Maritan, 
A. (2001) Molecular dynamics studies on HIV-1 
protease, Proteins 43, 365-372. 

[15] Dokholyan, N.V. & Shakhnovich, E.I. (2001) Un- 
derstanding hierarchica l protein evolution from 
first principles, preprint cond-mat/0104469 , 

[16] Fontana, W. & Schuster, P. (1998) Continuity in 
evolution: On the nature of transitions, Science 
280, 1451-1455. 

[17] Gerstein, M. (1998) Patterns of protein-fold us- 
age in eight microbial genomes: A comprehensive 
structural census, Proteins 33, 518-534. 

[18] Gillespie, J.H. (1989) Lineage effects and the index 
of dispersion of molecular evolution, Mol. Biol. 
Evol. 6, 636-647. 

[19] Gillespie, J.H. (1991) The causes of molecular evo- 
lution, Oxford University Press. 

[20] Goldstein, R.A., 

Luthey-Schulten ZA. & Wolynes, PG. (1992) Op- 
timal protein-folding codes from spin-glass theory, 
Proc. Natl. Acad. Sci. USA 89, 4918-4922. 

[21] Govindarajan, S. & Goldstein, RA. (1997) The 
foldability landscape of model proteins, Biopoly- 
mers 42, 427-438. 

[22] Govindarajan, S. & Goldstein, R.A. (1998) On 
the thermodynamic hypothesis of protein folding, 
Prod. Natl. Acad. Sci. USA 95, 5545-5549. 

[23] Henikoff, S., Greene, E.-A., Pietrokovski, S., Bork, 
P., Attwood, T.-K. & Hood, L. (1997) Gene 
families: The taxonomy of protein paralogs and 
chimeras, Science 278, 609-614. 

[24] Hobohm, U. & Sander, C. (1994) Enlarged repre- 
sentative set of protein structure, Protein Sci. 3, 
522-524. 

[25] Holm, L. & Sander, G (1996) Mapping the 
protein universe, Science 273, 595-602 and 



13 



http : //www2 . ebi . ac , uk/dali/f ssp/| . [42] 

[26] Huynen, M.A., Stadlcr, P.F. & Fontana, W. 
(1996) Smoothness within ruggedness: The role 
of neutrality in adaptation, Proc. Natl. Acad. Sci. 
USA 93, 397-401. [43] 

[27] Kannan, N. & Vishveshwara, S. (1999) Identifica- 
tion of side-chain clusters in protein structures by [44] 
a graph spectral method, J. Mol. Biol. 292, 441- 
464; (2000) Aromatic clusters: A determinant of 
thermal stability of thermophylic proteins, Prot. 
Eng. 13, 753-761. [45] 

[28] Keefe, A.-D. & Szostak, J.W. (2001) Functional 
proteins from a random sequence library, Nature 
410, 715-718. [46] 

[29] Kimura, M. (1968) Evolutionary rate at the molec- 
ular level, Nature 217, 624-626. 

[30] Kimura, M. (1977) Preponderance of synonimous 
changes as evidence for the neutral theory of 
molecular evolution. Nature 267, 275-6. 

[31] King, J.-L. & Jukes, T.H. (1969) Non-Darwinian [47] 
evolution, Science 164, 788-798. 

[32] Koonin, E.-V., Aravind, L. & Kondrashov, A.-S. 

(2000) The impact of comparative genomics on [48] 
our understanding of evolution, Cell 101, 573-576. 

[33] Langley, C.-H. & Fitch, W.M. (1973) An estima- [49] 
tion of the constancy of the rate of molecular evo- 
lution, J. Mol. Evol. 3, 161-177 

[34] Li, W.H., Tanimura, M. & Sharp, P.M. (1987) An 

evaluation of the molecular clock hypothesis us- [50] 
ing mammalian DNA sequences, J. Mol. Evol. 25, 
330-342. 

[35] Mirny, L. & Shakhnovich, E.I. (1996) How to de- [51] 
rive a protein folding potential? A new approach 
to an old problem, J. Mol. Biol. 264, 1164-1179. 

[36] Mirny, L.A. & Shakhnovich, E.I. (1999) Univer- 
sally conserved positions in protein folds: Reading [52] 
evolutionary signals about stability, folding kinet- 
ics, and function, J. Mol. Biol. 291, 177-196. 

[37] Mirny, L. & Shakhnovich, E. (2001) Evolutionary 
conservation of the folding nucleus, J. Mol. Biol. 
308, 123-129. 

[38] Orencia, M.-C, Yoon, J.-S., Ness, J.-E., Stem- 
mer, W.-P & Stevens, R.-C. (2001) Predicting the 
emergence of antibiotic resistance by directed evo- 
lution and structural analysis, Nat. Struct. Biol. 8, 
238-242. 

[39] Ohta, T. (1976) Role of very slightly deleterious 
mutations in molecular evolution and polymor- 
phism, Theor. Pop. Biol. 10, 254-275. 

[40] Ohta, T. & Kimura, M. (1971) On the constancy 
of the evolutionary rate of cistrons, J. Mol. Evol. 
1, 18-25. 

[41] Ptitsyn, O.B. (1998) Protein folding and protein 
evolution: Common folding nucleus in different 
subfamilies of c-type cytochrome? J. Mol. Biol. 
278, 655-666. 



Ptitsyn, O.B. & Ting, K.H. (1999) Non-functional 
conserved residues in globins and their possible 
role as a folding nucleus, J. Mol. Biol. 291, 671- 
682. 

Rost, B. (1997) Protein structures sustain evolu- 
tionary drift, Fol. Des. 2, S19-S24. 
Schuster, P., Fontana, W., Stadler, P.F. & Ho- 
facker, I.L. (1994) From sequences to shapes and 
back - A case-study in RNA secondary structures, 
Proc. R. Soc. London B 255, 279-284; 
Shakhnovich, E., Abkevich, V. & Ptitsyn, O. 
(1996) Conserved residues and the mechanism of 
protein folding, Nature 379, 96-98. 
Shakhnovich, E.I. & Gutin, A.M. (1993) Engineer- 
ing of stable and fast-folding sequences of model 
proteins, Proc. Natl. Acad. Sci. USA 90, 7195- 
7199; Shakhnovich E.I. (1994) Proteins with se- 
lected sequences fold into unique native confor- 
mation, Phys. Rev. Lett. 72, 3907-3910. 
Silverman, J. A., Balakrishan, R. & Harbury, P.B. 
(2001) Reverse engineering the (/3/a)s barrel fold, 
Proc. Natl. Acad. Sci. USA 98, 3092-3097. 
Takahata, N. (1987) On the overdispersed molec- 
ular clock, Genetics 116, 169-179. 
Tiana, C, Broglia, R.A., Roman, H.E., Vigezzi, 
E. & Shakhnovich, E.I. (1998) Folding and mis- 
folding of designed proteinlike chains with muta- 
tions, J. Chem. Phys. 108, 757-761. 
Thornton, J.M., Orengo, C.A., Todd, A.E. & 
Pearl, F.M.G. (1999) Protein folds, functions and 
evolution J. Mol. Biol. 293, 333-342. 
Voigt, C.A., Mayo, ST., Arnold, F.H. & Wang, 
Z.G. (2001) Computational method to reduce the 
search space for directed protein evolution, Proc. 
Natl. Acad. Sci. USA 98, 3778-3783. 
Zuckerkandl, E. & Pauling, L. (1962), in Horizons 
in Biochemistry, eds. M. Kasha and B. Pullman 
(Academic Press, New York) . 



11 



Protein 


PDB id. 


N 


-Z 


a 


{x) 


a{x) 


T 


rubredoxin (m.) 


liro 


53 


0.357 


0.361 


0.634 


0.184 


1.8 


rubredoxin (th.) 


lbrf 


53 


0.455 


0.405 


0.619 


0.175 


2.0 


cytochrome c 


451c 


82 


0.403 


0.462 


0.548 


0.193 


2.2 


ribonuclease 


7rsa 


124 


0.410 


0.424 


0.666 


0.188 


2.2 


lysozyme 


31zt 


129 


0.355 


0.512 


0.628 


0.195 


2.8 


myoglobin 


la6g 


151 


0.458 


0.575 


0.599 


0.191 


2.4 


ubiquitin 


lu9aA 


160 


0.402 


0.568 


0.631 


0.195 


2.4 


TIM barrel 


7timA 


247 


0.377 


0.795 


0.656 


0.192 


2.4 



TABLE I. Summary of the seven neutral networks studied. For rubredoxin, m. and th. stand for the mesophylic 
and thermophylic form respectively, x indicates the fraction of neutral neighbors and r is the correlation length 
of x along an evolutionary trajectory obtained from the stretched exponential decay of the correlation function. 



Abbreviations: PDB Protein Data Bank, SNC Structurally Constrained Neutral Model, FSSP Fold 
classification based on Structure-Structure alignment of Proteins. 
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